Introduction

Preliminary info

This course will give a basic introduction on (untargeted) preprocessing of liquid chromatography coupled to mass spectrometry (LC-MS) (also extendable to GC-MS or LC-MS/MS) based metabolomics. The methods and workflow presented in this course our mainly based on the workflow presented by the xcms package on Bioconductor.

For this course, you will need R version > 4.4.

If you haven’t already, the packages needed to complete today’s course can be installed by running the next code block:

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("xcms")
BiocManager::install("MsExperiment")
BiocManager::install("Spectra")
remotes::install_github("https://github.com/wkumler/squallms")
install.packages("RColorBrewer")
install.packages("RaMS")

And load the required packages:

library(xcms)
library(MsExperiment)
library(Spectra)
library(RColorBrewer)
library(squallms)
library(dplyr)
library(ggplot2)
library(SummarizedExperiment)
library(RaMS)

Review of LC-MS data

First, let’s do a quick review of what kind of data we are dealing with. LC-MS data is generated by first separating chemical molecules using LC or GC columns based on certain properties (e.g. polarity). Afterwards the molecules are ionized, and in the MS instrument, they are separated based on their mass-to-charge ratio or m/z and their intensity (abundance) is measured. Thus, molecules are separated in two different dimensions, the retention time dimension (from the LC) and the mass-to-charge dimension (from the MS) making it easier to measure and identify molecules in more complex samples.

Animation by Johannes Rainer & Phillipine Louail https://jorainer.github.io/xcmsTutorials/
Animation by Johannes Rainer & Phillipine Louail https://jorainer.github.io/xcmsTutorials/

The workflow presented by xcms (but also applicable in for example MZmine3 and most untargeted algorithms) is:

  • chromatographic peak detection: process in which chromatographic peaks are identified within a sample (file).
  • alignment: process that adjusts for retention time differences (i.e. possible signal drifts from the LC) between measurements/samples.
  • correspondence: grouping of chromatographic peaks (presumably from the same ion) across samples/files into LC-MS features

Goals of the workshop

  • Learn how to preprocess LC-MS data from open-source formats (such as .mzML) in R using `xcms`` and other open-source packages
  • Evaluate and understand the results of the preprocessing workflow
  • Be able to build analysis workflows specifically tailored to your data

Example dataset

In this course we are using a subset of the data presented in Lloyd-Price et al. (2019). The subset was downloaded via Metaboanalyst (https://new.metaboanalyst.ca/MetaboAnalyst/upload/SpectraUpload.xhtml) and contains 10 spectra (UPLC-Q/E-ESI-, C18) organized into three groups (Healthy, Crohn’s Disease and QC). The data is already available in the open-source .mzML format.

Importing LC-MS data

First, we import our data as MsExperiment using the readMsExperiment() function. We also add some basic metadata which contains info on the sample groups.

# Read the file names from the data folder
files <- list.files("data/",full.names = TRUE,pattern = ".mzML")

# Import the available metdata with the groups
metadata <- read.table("data/trimmed_metadata.txt",header = TRUE)
# Re-order metadata to match the files sequence
metadata <- metadata[order(metadata$Sample),]


#' Import the data of the experiment
ms_data <- readMsExperiment(files, sampleData = metadata)
#Take a look at the object
ms_data
## Object of class MsExperiment 
##  Spectra: MS1 (4461) 
##  Experiment data: 10 sample(s)
##  Sample data links:
##   - spectra: 10 sample(s) to 4461 element(s).

Our object contains 4461 MS1 spectra in 10 samples

Various functions can be used to access specific data from the object:

#acess sample metadata
sampleData(ms_data)
## DataFrame with 10 rows and 3 columns
##         Sample     Disease   spectraOrigin
##    <character> <character>     <character>
## 1     CD-6KUCT     Control C:\\Users\\p...
## 2     CD-77FXR     Control C:\\Users\\p...
## 3     CD-9OS5Y     Control C:\\Users\\p...
## 4     CD-9WOBP     Control C:\\Users\\p...
## 5     HC-9SNJ4     Disease C:\\Users\\p...
## 6     HC-9X47O     Disease C:\\Users\\p...
## 7     HC-AMR37     Disease C:\\Users\\p...
## 8     HC-AUP8B     Disease C:\\Users\\p...
## 9   QC_PREFA02          QC C:\\Users\\p...
## 10  QC_PREFB02          QC C:\\Users\\p...
#access spectra data
spectra(ms_data)
## MSn data (Spectra) with 4461 spectra in a MsBackendMzR backend:
##        msLevel     rtime scanIndex
##      <integer> <numeric> <integer>
## 1            1   480.036         1
## 2            1   480.303         2
## 3            1   480.571         3
## 4            1   480.838         4
## 5            1   481.106         5
## ...        ...       ...       ...
## 4457         1   598.670       442
## 4458         1   598.997       443
## 4459         1   599.265       444
## 4460         1   599.532       445
## 4461         1   599.800       446
##  ... 33 more variables/columns.
## 
## file(s):
## CD-6KUCT.mzML
## CD-77FXR.mzML
## CD-9OS5Y.mzML
##  ... 7 more files

Centroiding data

MS instruments allow to export data in profile or centroid mode. Profile data contains the signal for all discrete m/z values (and retention times) for which the instrument collected data (see first figure below). MS instruments continuously sample and record signals, therefore a mass peak for a single ion in one spectrum will consist of multiple intensities at discrete m/z values. The process to reduce this distribution of signals to a single representative mass peak (the centroid, see second figure below) is called centroiding.

Profile data
Profile data
Centroid data
Centroid data

In this case, we need to inspect if the data we are analyzing is profile data (and thus still needs to be centroided for use with xcms) or not. Usually, centroiding is performed when converting the data from the proprietary data format to .mzML, but it can also be performed in R (see https://jorainer.github.io/xcmsTutorials/articles/xcms-preprocessing.html#centroiding-of-profile-ms-data).

We can easily check if the data is centroided using the function isCentroided(). Checking in one file should be sufficient.

isCentroided(spectra(ms_data[1]))
##   [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##  [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##  [31] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##  [46] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##  [61] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##  [76] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##  [91] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [106] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [121] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [136] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [151] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [166] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [181] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [196] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [211] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [226] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [241] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [256] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [271] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [286] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [301] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [316] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [331] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [346] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [361] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [376] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [391] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [406] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [421] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [436] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

The function returns TRUE for all spectra in the sample.

Visually, we can check some of the centroided data for a known compound in the data (see suppplementary info of the Lloyd-Price paper). In this workshop, we will be looking at:

Method mz RT (min) Metabolite
C18n 538.3212 8.66 Met - Cholate
#Here we extract the m/z and RT range for our compound in the two QC samples using filterRT & filterMzRange, afterwards we plot.
ms_data[c(9,10)] |>
    filterSpectra(filterMzRange,538.3212 + c(-0.005, 0.005)) |>
    filterSpectra(filterRt,519.6 + c(-10,10)) |>
    plot()

Basic data inspection

We can inspect the total ion chromatogram using chromatogram(). Setting aggregationFun = "sum" allows to calculate the total ion chromatogram (TIC) and aggregationFun = "max" the base peak chromatogram (BPC).

#' Extract and plot a TIC
tic <- chromatogram(ms_data, aggregationFun = "sum")
plot(tic,main = "Total Ion Chromatogram")

We can also create boxplots representing the distribution of the total ion currents per data file. Such plots can be very useful to spot potentially problematic MS runs.

## Generate a color set for the three groups
group_colors <- paste0(brewer.pal(3, "Set1")[1:3], "60")
names(group_colors) <- c("Control", "Disease","QC")

## Get the total ion current by file
tc <- spectra(ms_data) |>
    tic() |>
    split(f = fromFile(ms_data))
boxplot(tc, col = group_colors[sampleData(ms_data)$Disease],
        ylab = "intensity", main = "Total ion current")

Data preprocessing

In this section, we will go through the 3 main steps of LC-MS preprocessing: peak picking, alignment and correspondence (and gap filling). Remember that you should tailor the parameters and settings to your specific dataset in you are refering to this document in the future.

Chromatographic peak detection

Chromatographic peak detection aims to identify peaks along the retention time axis that represent the signal from individual compounds’ ions. This involves identifying and quantifying such signals as shown in the sketch below.

Peak detection, sketch by J.Rainer & P. Louail
Peak detection, sketch by J.Rainer & P. Louail

Within xcms, several peak detection algorithms can be used and accessed by their respective parameter objects: MatchedFilterParam, CentWaveParam and MassifquantParam. However, in this workshop we will focus on the centWave algoritm (see the original publication (Tautenhahn, Böttcher, and Neumann 2008) for more details) as it is most frequently used in LC-MS metabolomics applications.

The different parameters available for the centWave algorithm can be called with CentWaveParam()

CentWaveParam()
## Object of class:  CentWaveParam 
##  Parameters:
##  - ppm: [1] 25
##  - peakwidth: [1] 20 50
##  - snthresh: [1] 10
##  - prefilter: [1]   3 100
##  - mzCenterFun: [1] "wMean"
##  - integrate: [1] 1
##  - mzdiff: [1] -0.001
##  - fitgauss: [1] FALSE
##  - noise: [1] 0
##  - verboseColumns: [1] FALSE
##  - roiList: list()
##  - firstBaselineCheck: [1] TRUE
##  - roiScales: numeric(0)
##  - extendLengthMSW: [1] FALSE
##  - verboseBetaColumns: [1] FALSE

It is strongly discouraged to use these default parameter settings for any of your preprocessing! There are tools available that automatically configure these parameters based on your data (such as IPO and autoTuner), but generally, the best results are obtained by tuning the parameters manually to your data and ions of interest. Before running any peak detection we can first visually inspect the extracted ion chromatogram of e.g. internal standards or compounds known to be present in the samples in order to evaluate and adapt the settings of the peak detection algorithm.

Let’s take another look at our known compound that we inspected earlier:

#define rt & mz ranges
rtr <- c(515, 530)
mzr <- 538.3212 + c(-0.005, 0.005)
## extract the chromatogram & add legend
chr_raw <- chromatogram(ms_data, mz = mzr, rt = rtr)
plot(chr_raw, col = group_colors[chr_raw$Disease],main = "Met - Cholate",xaxt ="n",lwd = 2) 
legend(x = 525, y = 6e5, legend = c("Control","Disease","QC"),lty = 1,col =  group_colors,cex = 1)
axis(1, at = seq(515, 530, by = 1), las=2)

And let’s try and run centWave peak detection with the default settings in this region:

#' Get default centWave parameters
cwp <- CentWaveParam()

#' "dry-run" peak detection on the EIC.
res <- findChromPeaks(chr_raw, param = cwp)
chromPeaks(res)
##      rt rtmin rtmax into intb maxo sn row column

As expected, no peaks are found by the algorithm.

From the raw chromatogram, we can see that we have a 4 chromatographic peaks present in 2 of the control samples, and in 2 of the QC samples, but the peak seems to be absent in the disease samples.

We can see that the peak width is quite small, at its smallest it is about 5 seconds, so of course, running with a default minimum peakwidth of 20, yields no results. In general, the lower and upper peak width should be set to include most of the expected chromatographic peak widths. A good rule of thumb is to set it from about half to about twice the average expected peak width. For the present data set we thus set peakwidth = c(2, 10).

Let’s try again:

#'  centWave parameters
cwp <- CentWaveParam(peakwidth = c(2,10))

#' Peak detection on the EIC.
res <- findChromPeaks(chr_raw, param = cwp)
chromPeaks(res)
##      rt rtmin rtmax into intb maxo sn row column

We still don’t detect any peaks, so let’s delve a little bit deeper into the parameters and find out what we can change to improve this. Remeber that the help pages usually contain all the info you need!

?CentWaveParam
## starting httpd help server ... done

On the help page, we find a detailed description of the different parameters, such as:

  • snthresh numeric(1) defining the signal to noise ratio cutoff.

When using centWave in extracted ion chromatograms, it often fails to estimate the noise level correctly, so it is advised to set the snthresh parameter quite low and later use a higher value when performing peak detection in the complete data range.

Trying again:

#' centWave parameters
cwp <- CentWaveParam(peakwidth = c(2,10),snthresh = 2)

#' Peak detection on the EIC.
res <- findChromPeaks(chr_raw, param = cwp)
chromPeaks(res)
##           rt   rtmin   rtmax    into      intb     maxo sn row column
## [1,] 520.546 518.085 524.338 2506826 1446253.1 821548.2  3   1      3
## [2,] 519.900 517.974 522.307 1212104  329609.7 625115.6  2   1     10

Now we do detect peaks in two of the samples, for each peak we get:"rt" (retention time of the peak apex), "rtmin" (minimal retention time), "rtmax" (maximal retention time), "into" (integrated, original, intensity of the peak), "maxo" (maximum intentity of the peak).

We can also plot the integration areas:

plot(res,  col = group_colors[chr_raw$Disease],main = "Met - Cholate",lwd = 2)

This already looks promising, but a lot of the peaks we can see are still left out. Let’s drop the snthresh parameter to 1.

#' centWave parameters
cwp <- CentWaveParam(peakwidth = c(2,10),snthresh = 1)

#' Peak detection on the EIC.
res <- findChromPeaks(chr_raw, param = cwp)
chromPeaks(res)
##           rt   rtmin   rtmax    into       intb      maxo sn row column
## [1,] 520.540 518.345 522.948 1067415  270295.39 454518.72  2   1      1
## [2,] 520.546 518.085 524.338 2506826 1446253.10 821548.25  3   1      3
## [3,] 520.088 518.215 521.960  144719   31158.82  62271.75  1   1      4
## [4,] 520.451 518.257 523.125 1213843  341236.03 508836.56  2   1      9
## [5,] 519.900 517.974 522.307 1212104  329609.72 625115.56  2   1     10
plot(res,col = group_colors[chr_raw$Disease],main = "Met - Cholate",lwd = 2)

Now we already observe some strange behavior! Notice that 4/5 peaks have a sn > 2 in our latest results, while in the first run with snthresh = 2, only two peaks were integrated. Indeed, the centWave algorithm can behave quite strange, especially in small rt & m/z ranges, so data evaluation afterwards is important. Also, we note that the small peak that is lying in the noise level in the plot, also get’s integrated. To mitigate this, we change the noiseparameter to 80000.

From the help page:numeric(1) allowing to set a minimum intensity required for centroids to be considered in the first analysis step (centroids with intensity < noise are omitted from ROI detection).

#' centWave parameters
cwp <- CentWaveParam(peakwidth = c(2,10),snthresh = 1,noise = 8e4)

#' Peak detection on the EIC.
res <- findChromPeaks(chr_raw, param = cwp)
chromPeaks(res)
##           rt   rtmin   rtmax    into      intb     maxo sn row column
## [1,] 520.540 518.345 522.948 1067415  270295.4 454518.7  2   1      1
## [2,] 520.546 518.085 524.338 2506826 1446253.1 821548.2  3   1      3
## [3,] 520.451 518.257 523.125 1213843  341236.0 508836.6  2   1      9
## [4,] 519.900 517.974 522.307 1212104  329609.7 625115.6  2   1     10
plot(res,col = group_colors[chr_raw$Disease],main = "Met - Cholate",lwd = 2)

With these settings, all the peaks in the EIC are correctly picked by the algorithm.

Another important parameter to define for your data is ppm.

numeric(1) defining the maximal tolerated m/z deviation in consecutive scans in parts per million (ppm) for the initial ROI definition.

This is highly dependent on the instrument you are using and the precision you’re expecting. In the next code block, we will check what the expected ppm deviation is for the ion.

To be a bit quicker, we will only check the QCs:

target <- ms_data[c(9,10)] |>
    filterSpectra(filterMzRange,538.3212 + c(-0.005, 0.005)) |>
    filterSpectra(filterRt,519.6 + c(-10,10)) 
plot(target)

We can observe some scattering of the data points around an m/z of 538.320 in the lower panel of the above plot. This scattering also decreases with increasing signal intensity (as for many MS instruments the precision of the signal increases with the intensity). To quantify the observed differences in m/z values for the signal of the ion we restrict the data to a region with signal for the ion. Below we first subset the data to the first file and then restrict the m/z range to values between 538.319 and 538.321

#' Reduce the data set to signal of the ion
signal <- target[1] |>
    filterSpectra(filterMzRange,c(538.319, 538.321)) |>
    spectra()

We extract the m/z values of the peaks from the consecutive scans and calculate the absolute difference between them.and express these differences in ppm (parts per million) of the average m/z of the peaks.

#' Calculate the difference in m/z values between scans
mz_diff <- signal |>
    mz() |>
    unlist() |>
    diff() |>
    abs()

mz_diff * 1e6 / mean(unlist(mz(signal)))
##        mz        mz        mz        mz        mz        mz        mz        mz 
## 1.4739500 1.8140924 0.9070462 0.6802846 0.1133808 1.0204270 1.4739500 1.0204270 
##        mz        mz        mz        mz        mz        mz        mz        mz 
## 0.6802846 0.2267615 1.5873308 1.1338077 0.7936654 0.4535231 0.4535231 1.0204270 
##        mz        mz 
## 0.1133808 0.5669039

It seems like the data is coming from a highly accurate instrument, as the ppm deviation is between 0 & 2! To accept some variance for other peaks that might me measured less accurately, we will set the ppm of centWave to 5.

It’s recommended to perform the above steps on a few known compounds and/or internal standard to get a good idea of what kind of parameters you should use for our dataset. Keeping our limited time in mind, we will try and see how we fare with our newly chosen parameters (leaving snthresh to the default of 10, to save some time and taking into account that centWave should better estimate the noise in the whole chromatogram). Also, we set verboseBetaColumns = TRUE, this fairly new parameter calculates new quality metrics for all the peaks based on the work by Kumler et al. (2023).

#' Perform peak detection on the full data set
cwp <- CentWaveParam(peakwidth = c(2, 10), ppm = 5,noise = 8e4,verboseBetaColumns = TRUE)
mse <- findChromPeaks(ms_data, param = cwp)
mse
## Object of class XcmsExperiment 
##  Spectra: MS1 (4461) 
##  Experiment data: 10 sample(s)
##  Sample data links:
##   - spectra: 10 sample(s) to 4461 element(s).
##  xcms results:
##   - chromatographic peaks: 9792 in MS level(s): 1

We detected 9792 peaks in the dataset!

Let’s double check if we managed to pick our known compound:be

chromPeaks(mse, mz = c(538.319, 538.321), rt = c(515, 535))
##              mz    mzmin    mzmax      rt   rtmin   rtmax      into      intb
## CP0392 538.3203 538.3198 538.3207 520.540 518.612 522.145  991610.4  991607.1
## CP2508 538.3212 538.3206 538.3214 520.546 518.085 524.338 2493232.7 2181860.4
## CP7607 538.3203 538.3198 538.3210 520.451 518.525 522.590 1142879.3 1142875.5
## CP8919 538.3204 538.3199 538.3215 519.900 518.242 522.040 1164363.0 1153023.0
##            maxo     sn  beta_cor beta_snr sample
## CP0392 454518.7 454518 0.8731539 6.300274      1
## CP2508 821548.2     19 0.9611432 6.846439      3
## CP7607 508836.6 508836 0.8757355 6.297063      9
## CP8919 625115.6     35 0.8294191 6.400441     10
eic <- chromatogram(mse,mz = c(538.319, 538.321), rt = c(515,530))
## Processing chromatographic peaks
plot(eic)

Note: sometimes peaks get wrongly split, for this, xcms has the function refineChromPeaks(), but we won’t discuss this here. See the xcms documentation for more info.

Retention time alignment

While chromatography helps to better discriminate between analytes it is also affected by variances that lead to shifts in retention times between measurement runs. Such differences can usually already be seen in a base peak chromatogram or total ion chromatogram.The alignment step aims to minimize these retention time differences between all samples within an experiment (see below for an illustration).

plot(tic,main = "Total Ion Chromatogram")

Peak alignment, sketch by J.Rainer & P. Louail
Peak alignment, sketch by J.Rainer & P. Louail

There exist a lot of retention time alignment algorithms. The two main methods implemented in xcms are the peakGroups method (C. A. Smith et al. 2006), which aligns the samples based on so called anchor peaks: these peaks are supposed to represent signal from ions expected to be present in most of the samples of an experiment and the method aligns these samples by minimizing the between-sample retention time differences observed for these peaks. The second method is the obiwarp algorithm (Prince and Marcotte 2006), which uses the full m/z and retention time data. It’s possible to perform retention time using info from all the samples, or only a subset of samples. For more information, see the xcms documentation. Here, since we have limited knowledge of the dataset or possible anchor peaks, we choose to use the obiwarp method. In xcms, the alignment can be performed with the adjustRtime() function.

One important parameter to adjust is binSize: , i.e. aggregating of intensity values falling within an m/z bin. Since we are dealing with a high res instrument, we’ll set it at 0.01.

mse <- adjustRtime(mse, param = ObiwarpParam(binSize = 0.01))

We can visually evaluate the alignment by plotting the original tic and the adjusted one, as well as a plot of the differences.

tic_adj <- chromatogram(mse, aggregationFun = "sum", chromPeaks = "none")
par(mfrow = c(3, 1), mar = c(4.5, 4.2, 1, 0.5))
plot(tic)
grid()
plot(tic_adj)
grid()
## Plot also the difference of adjusted to raw retention time.
plotAdjustedRtime(mse)
grid()

Retention time drift certainly wasn’t the worst in this dataset, but still we see some small improvement after performing the alignment step.

Evaluating the impact on our target peak:

par(mfrow = c(2, 1))
## Plot the raw data
plot(chr_raw)
grid()
## Extract the chromatogram from the adjusted object
chr_adj <- chromatogram(mse, rt = rtr, mz = mzr)
## Processing chromatographic peaks
plot(chr_adj)
grid()

When evaluating the RT alignment in our single peak, we have to admit that it isn’t perfect, since the “biggest” peak got shifted a bit too much too the right. In our experience, getting perfect peak alignment for all peaks in a dataset is quite impossible, so one has to try different algorithms and/or subset-based alignment to find results that are sufficiently good for further analysis.

Correspondence

The final step of the LC-MS preprocessing with xcms is the correspondence analysis, in which chromatographic peaks from the same types of ions (compounds) are grouped across samples to form the so called LC-MS features.

Correspondence, sketch by J.Rainer & P. Louail
Correspondence, sketch by J.Rainer & P. Louail

In xcms, two correspondence algorithms can be used: NearestPeaksParam & PeakDensityParam. For our example we use the peak density method. This algorithm iterates through small slices along the m/z dimension and groups, within each slice, chromatographic peaks with similar retention times.

Settings for this algorithm can be best tested and optimized using the plotChromPeakDensity() function on extracted chromatograms. The two main parameters to tune are bw and binSize

Peaks with similar retention time will result in a higher peak density at a certain retention time and are thus grouped together. The grouping depends on the smoothness of the density curve and can be configured with parameter bw.

Furthermore, it makes sense to perform correspondence taking into account sample groups, as we might expect different features in case or control samples. We define our groups using the sampleGroupsparameter.

We can illustrate this using our target peak as an example:

#' Default parameters for peak density; bw = 30
pdp <- PeakDensityParam(sampleGroups = sampleData(mse)$Disease)

#' Test these settings on the extracted slice
plotChromPeakDensity(chr_adj, param = pdp)

For this particular part of the data, the default settings seem to suffice, as our 4 peaks in the 4 samples succesfully are grouped in the same feature (the grey area).It is however advisable to evaluate settings on multiple slices, ideally with signal from more than one compound being present.

Let’s widen to retention time for this slice, and see what happens.

chr_adj_wide <- chromatogram(mse, rt = rtr + c(-20,20), mz = mzr)
## Processing chromatographic peaks
plotChromPeakDensity(chr_adj_wide, param = pdp)

This slice contains signal from several ions resulting in multiple chromatographic peaks along the retention time axis. With the default settings, in particular with bw = 30, all these peaks were however assigned to the same feature (indicated with the grey rectangle). Signal from different ions would thus be treated as a single entity. We repeat the analysis below with a reduced value for parameter bw.

#' Reducing the bandwidth
pdp <- PeakDensityParam(sampleGroups = sampleData(mse)$Disease, bw = 2)
plotChromPeakDensity(chr_adj_wide, param = pdp, col = group_colors[chr_adj_wide$Disease],lwd = 2)

Setting bw = 2 strongly reduced the smoothness of the density curve resulting in a good seperation of the of density peaks and hence a nice grouping of (aligned) chromatographic peaks into separate features. Note that the left peak isn’t grouped into a feature. This can be explained by the parameter minFraction (default = 0.5) which defines the proportion of samples within at least one sample group in which chromatographic peaks need to be identified in order to define a feature. As the peak is only present in 1/4 control samples, and in 1/4 disease samples, it doesn’t get grouped into a feature. Adjusting to e.g. 0.2 will allow the peaks to be grouped into a feature:

# Reducing the minFraction
pdp <- PeakDensityParam(sampleGroups = sampleData(mse)$Disease, bw = 2,minFraction = 0.2)
plotChromPeakDensity(chr_adj_wide, param = pdp, col = group_colors[chr_adj_wide$Disease],lwd = 2)

Finally, binSize defines the m/z widths of the slices along the m/z dimension the algorithm will iterate through. Note that by default a constant m/z width is used, which might however not reflect the m/z-dependent measurement error of some instruments (such as TOF instruments). To address this, the parameter ppm was recently added that allows to generate m/z-dependent bin size.

Since we seem to be dealing with a highly accurate instrument, we will allow a maximal difference of the m/z of chromatographic peaks of binSize = 0.01 & 5ppm.

# Reducing the binSize
pdp <- PeakDensityParam(sampleGroups = sampleData(mse)$Disease, bw = 2,minFraction = 0.5,binSize = 0.01,ppm = 5)
plotChromPeakDensity(chr_adj_wide, param = pdp, col = group_colors[chr_adj_wide$Disease],lwd = 2)

Now that we are satisfied with our settings, we can perform the correspondence on the whole dataset:

mse <- groupChromPeaks(mse, param = pdp)

Inspecting the final result also shows all the steps we performed:

mse
## Object of class XcmsExperiment 
##  Spectra: MS1 (4461) 
##  Experiment data: 10 sample(s)
##  Sample data links:
##   - spectra: 10 sample(s) to 4461 element(s).
##  xcms results:
##   - chromatographic peaks: 9792 in MS level(s): 1 
##   - adjusted retention times: mean absolute difference 0.468 seconds
##   - correspondence results: 1892 features in MS level(s): 1

Finally, just under 1900 features are identified in this dataset.

The results from the correspondence analysis were added to the XcmsExperiment object. These can be extracted with the featureDefinitions() function, that extracts the definition of the LC-MS features and the featureValues() function that extracts the numerical matrix with the feature abundances (in all samples).

#' Definition of the features
featureDefinitions(mse) |>
    head()
##           mzmed    mzmin    mzmax    rtmed    rtmin    rtmax npeaks Control
## FT0001 250.1526 250.1524 250.1530 593.5501 593.2380 594.0818      4       1
## FT0002 253.0503 253.0503 253.0508 502.0092 501.7769 502.8795      3       1
## FT0003 253.1445 253.1442 253.1448 497.4695 495.8700 499.1075     10       4
## FT0004 254.1477 254.1474 254.1481 497.5510 496.1317 499.1075      9       4
## FT0005 255.0662 255.0660 255.0665 510.9348 509.3073 511.7819      5       2
## FT0006 255.1243 255.1243 255.1243 481.1268 481.0861 481.1676      2       2
##        Disease QC      peakidx ms_level
## FT0001       1  2 1923, 47....        1
## FT0002       0  2 2262, 73....        1
## FT0003       4  2 221, 118....        1
## FT0004       3  2 187, 114....        1
## FT0005       1  2 284, 237....        1
## FT0006       0  0     52, 3492        1
featureValues(mse, method = "maxint") |>
    head()
##        CD-6KUCT.mzML CD-77FXR.mzML CD-9OS5Y.mzML CD-9WOBP.mzML HC-9SNJ4.mzML
## FT0001            NA       1169531            NA            NA       3328653
## FT0002            NA            NA      487055.6            NA            NA
## FT0003    11405118.7      12220647    11833744.0      11324074       9851669
## FT0004     1653582.0       1739580     1701726.8       1612661       1452341
## FT0005      973258.2            NA     1007338.0            NA            NA
## FT0006     2015611.3            NA            NA       5181175            NA
##        HC-9X47O.mzML HC-AMR37.mzML HC-AUP8B.mzML QC_PREFA02.mzML
## FT0001            NA            NA            NA        893125.6
## FT0002            NA            NA            NA        496577.6
## FT0003     8067869.2       3599599       9391604       9484452.5
## FT0004     1060389.7            NA       1321638       1461350.1
## FT0005      625210.2            NA            NA       5379881.4
## FT0006            NA            NA            NA              NA
##        QC_PREFB02.mzML
## FT0001        852592.5
## FT0002        501200.8
## FT0003       9803085.4
## FT0004       1477054.0
## FT0005       5446978.2
## FT0006              NA

The feature matrix might, as can also be seen above, contain missing values. These represent features for which no chromatographic peak was identified in one (or more) sample(s). While a number of imputation methods exist to deal with missing values, it might be more advisable to instead rescue signal. xcms provides such gap filling which is explained in the next section.

Gap filling

Missing values in feature matrices from an xcms-based preprocessing represent cases in which, in a particular sample, no chromatographic peak was identified in the m/z - retention time region of the feature. This could either represent a truly missing value (because the ion/compound was not present in that sample) or a failure of the peak detection algorithm to identify a peak (either because the measured signal was too noisy, or too low, or a combination of both).

We’ll inspect FT0005

FT5 <- featureDefinitions(mse)[5,]
chrs <- chromatogram(mse, mz = c(FT5$mzmin,FT5$mzmax), rt = c(FT5$rtmin,FT5$rtmax) + c(-10,10))
## Processing chromatographic peaks
## Processing features
plot(chrs, col = group_colors[chr_adj_wide$Disease], lwd = 2)

Looking close, the signal was likely too noisy or too sparse in some samples (i.e. to few data points to properly detect a peak). In all cases, however, signal from (presumably) the same ion was measured in both samples. Thus, reporting a missing value would not be correct for these. The aim of the gap filling is now to rescue signal for such features by integrating the intensities measured within the feature’s m/z - retention time area in the sample(s) in which no chromatographic peak was detected.

mse <- fillChromPeaks(mse, param = ChromPeakAreaParam())
FT5 <- featureDefinitions(mse)[5,]
chrs <- chromatogram(mse, mz = c(FT5$mzmin,FT5$mzmax), rt = c(FT5$rtmin,FT5$rtmax) + c(-10,10))
## Processing chromatographic peaks
## Processing features
plot(chrs, col = group_colors[chr_adj_wide$Disease], lwd = 2)

Now all signals are correctly integrated!

Reporting results and used parameters

xcms allows to easily report what steps you performed and which settings you used:

processHistory(mse)
## [[1]]
## Object of class "XProcessHistory"
##  type: Peak detection 
##  date: Wed Jul 17 15:23:09 2024 
##  info:  
##  fileIndex: 1,2,3,4,5,6,7,8,9,10 
##  Parameter class: CentWaveParam 
##  MS level(s) 1 
## 
## [[2]]
## Object of class "XProcessHistory"
##  type: Retention time correction 
##  date: Wed Jul 17 15:27:05 2024 
##  info:  
##  fileIndex: 1,2,3,4,5,6,7,8,9,10 
##  Parameter class: ObiwarpParam 
##  MS level(s) 1 
## 
## [[3]]
## Object of class "XProcessHistory"
##  type: Peak grouping 
##  date: Wed Jul 17 15:32:16 2024 
##  info:  
##  fileIndex: 1,2,3,4,5,6,7,8,9,10 
##  Parameter class: PeakDensityParam 
##  MS level(s) 1 
## 
## [[4]]
## Object of class "XProcessHistory"
##  type: Missing peak filling 
##  date: Wed Jul 17 15:35:40 2024 
##  info:  
##  fileIndex: 1,2,3,4,5,6,7,8,9,10 
##  Parameter class: ChromPeakAreaParam 
##  MS level(s) 1
#' Peak detection parameters
processHistory(mse)[[1]]@param
## Object of class:  CentWaveParam 
##  Parameters:
##  - ppm: [1] 5
##  - peakwidth: [1]  2 10
##  - snthresh: [1] 10
##  - prefilter: [1]   3 100
##  - mzCenterFun: [1] "wMean"
##  - integrate: [1] 1
##  - mzdiff: [1] -0.001
##  - fitgauss: [1] FALSE
##  - noise: [1] 80000
##  - verboseColumns: [1] FALSE
##  - roiList: list()
##  - firstBaselineCheck: [1] TRUE
##  - roiScales: numeric(0)
##  - extendLengthMSW: [1] FALSE
##  - verboseBetaColumns: [1] TRUE

Data evaluation

Classic approaches

The filterFeatures() function provides a robust solution for filtering features based on conventional quality assessment criteria. It supports multiple types of filtering, allowing users to tailor the filtering process to their specific needs, all controlled by the filter argument. An example for RSD (or CV) filtering is shown below.

The RsdFilter enable users to filter features based on their relative standard deviation (coefficient of variation) for a specified threshold. It is recommended to base the computation on quality control (QC) samples.

# Set up parameters for RsdFilter
rsd_filter <- RsdFilter(threshold = 0.3,
                        qcIndex = sampleData(mse)$Disease == "QC")

# Apply the filter to faakho object
mse_filtered <- filterFeatures(object = mse, filter = rsd_filter)
## 431 features were removed

Other approaches

Recently, a lot of work on peak data quality has been done by W. Kumler, so we will be using the newly implemented quality metrics in xcms and the package squallms to evaluate the quality of the LC-MS features in our dataset. In this section we will provide a short introduction to exploring quality metrics of the features and peaks identified by xcms.

First, squallms provides a function that turns the XcmsExperiment object into a flat file containing the feature and peak information.

feat_peak_info <- makeXcmsObjFlat(mse) |>
    select(feature, starts_with("mz"), starts_with("rt"), filename, filepath)
## Warning in .local(object, ...): Had to remove feature definitions along with
## the adjusted retention times because of the dependency between them.
feat_peak_info |>
    head()|>
    mutate(filepath = paste0(substr(filepath, 1, 13), "~"))|>
    knitr::kable(caption = "Output from makeXcmsObjFlat.")
Output from makeXcmsObjFlat.
feature mz mzmin mzmax rt rtmin rtmax filename filepath
FT0001 250.1527 250.1526 250.1529 9.906700 9.857667 9.960200 CD-77FXR.mzML C:~
FT0001 250.1530 250.1529 250.1530 9.888500 9.843917 9.937533 HC-9SNJ4.mzML C:~
FT0001 250.1524 250.1521 250.1528 9.902017 9.861900 9.955517 QC_PREFA02.mzML C:~
FT0001 250.1525 250.1522 250.1529 9.892233 9.847650 9.937717 QC_PREFB02.mzML C:~
FT0001 250.1527 250.1521 250.1529 9.902067 9.848567 9.955567 CD-6KUCT.mzML C:~
FT0001 250.1527 250.1521 250.1529 9.877983 9.846783 9.953783 CD-9OS5Y.mzML C:~

Next, we extract the most usefull metrics: similarity to a bell curve (beta_corr) and the signal-to-noise ratio (SNR).

msdata <- grabMSdata(unique(feat_peak_info$filepath), verbosity = 0)
shape_metrics <- extractChromMetrics(feat_peak_info, ms1_data = msdata$MS1)
knitr::kable(head(shape_metrics), caption = "Format of the output from extractChromMetrics")
Format of the output from extractChromMetrics
feature med_mz med_rt med_cor med_snr
FT0001 250.1527 9.908700 0.9158461 6.570717
FT0002 253.0506 8.375929 0.8745649 3.793918
FT0003 253.1445 8.293762 0.9524109 10.623240
FT0004 254.1478 8.294917 0.9430070 7.272384
FT0005 255.0664 8.512167 0.9363874 7.284747
FT0006 255.1244 8.028625 0.6150342 3.620868

We can inspect the values for these metrics for our data:

summary(shape_metrics$med_cor)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## -0.4360  0.6937  0.8454  0.7666  0.9111  0.9879       7
summary(shape_metrics$med_snr)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.3087  3.0300  4.4665  4.6844  6.0806 14.4058       7

To get an idea of the importance and meaning of this value, we can plot a random selection of peaks that are corresponding to picked features and plot them together with their assigned metrics.

set.seed(123)
some_random_feats <- shape_metrics |>
    slice_sample(n = 8)
some_random_feats |>
    mutate(mzmin = med_mz - med_mz * 5 / 1e6) |>
    mutate(mzmax = med_mz + med_mz * 5 / 1e6) |>
    mutate(rtmin = med_rt - 0.2) |>
    mutate(rtmax = med_rt + 0.2) |>
    left_join(msdata$MS1, join_by(
        between(y$rt, x$rtmin, x$rtmax),
        between(y$mz, x$mzmin, x$mzmax)
    )) |>
    qplotMS1data(color_col = "med_cor") +
    geom_vline(aes(xintercept = med_rt), color = "red") +
    geom_text(aes(x = Inf, y = Inf, label = paste0("SNR: ", round(med_snr))),
        data = some_random_feats, hjust = 1, vjust = 1, color = "black"
    ) +
    facet_wrap(~feature, scales = "free", nrow = 3) +
    scale_color_continuous(limits = c(0, 1), name = "Peak shape metric") +
    scale_y_continuous(expand = expansion(c(0, 0.25))) +
    theme(legend.position.inside = c(0.82, 0.12), legend.position = "inside") +
    guides(color = guide_colorbar(direction = "horizontal", title.position = "top"))

In the above plot we can clearly see that high quality peak data corresponds to higher scores for both SNR & the peak shape metric. However, defining strict cutoffs to filter data is hard and will require some manual inspection of more peaks. From what we see above, let’s see how peaks look that have an SNR of <2 and a peak shape metric < 0.5

set.seed(123)

bad_peaks <- subset(shape_metrics, med_cor < 0.5 & med_snr < 2) |>
    slice_sample(n = 8)

bad_peaks |>
    mutate(mzmin = med_mz - med_mz * 5 / 1e6) |>
    mutate(mzmax = med_mz + med_mz * 5 / 1e6) |>
    mutate(rtmin = med_rt - 0.2) |>
    mutate(rtmax = med_rt + 0.2) |>
    left_join(msdata$MS1, join_by(
        between(y$rt, x$rtmin, x$rtmax),
        between(y$mz, x$mzmin, x$mzmax)
    )) |>
    qplotMS1data(color_col = "med_cor") +
    geom_vline(aes(xintercept = med_rt), color = "red") +
    geom_text(aes(x = Inf, y = Inf, label = paste0("SNR: ", round(med_snr))),
        data = bad_peaks, hjust = 1, vjust = 1, color = "black"
    ) +
    facet_wrap(~feature, scales = "free", nrow = 3) +
    scale_color_continuous(limits = c(0, 1), name = "Peak shape metric") +
    scale_y_continuous(expand = expansion(c(0, 0.25))) +
    theme(legend.position.inside = c(0.82, 0.12), legend.position = "inside") +
    guides(color = guide_colorbar(direction = "horizontal", title.position = "top"))

We can indeed see that most of these features originate from quite bad quality peaks, although, depending on the strictness of the analyst and/or knowledge about the possible formed ion, lesser quality peak shapes might be tolerated (e.g. FT0356 & FT0760).

Conclusion and further analysis

Further analysis

For downstream analyses, that don’t need access to this MS data anymore, the preprocessing results could be represented equally well using a SummarizedExperiment object, which is Bioconductor’s standard container for large-scale omics data. xcms provides with the quantify() function a convenience function to extract all results from an XcmsExperiment result object and return it as a SummarizedExperiment.

res <- quantify(mse, method = "sum")

After preprocessing, the data could be normalized or scaled to remove any technical variances from the data. While a simple e.g. median scaling could be done with a few lines of R code also more advanced (but not always needed) normalization algorithms are available in e.g. Bioconductor’s preprocessCore package or other packages.

Differential abundance analysis could be performed using the limma package or with any of the other packages or methods available in R.

Many chromatographic peaks (and subsequently also features) in untargeted metabolomics data sets will represent isotopes or also different ions/adducts of the same compound. The CAMERA package aimed to identify and group such features in a data set. Also, the MetaboAnnotation package provides functions to assist in the annotation of features from LC-MS as well as LC-MS/MS experiments.

For more information on general MS data analysis in R or spectra similarity calculations can be found in the RforMassSpectrometry book or in the various workshops/tutorials at SpectraTutorials.

Take-home messages

This workshop provides an introduction to LC-MS metabolomics preprocessing in R using the xcms framework. It is not an exhaustive how-to and/or gold-standard for every data analysis. A lot of information and code here was derived from the documentation of xcms, which we highly recommended to check out in detail at https://bioconductor.org/packages/release/bioc/html/xcms.html and https://jorainer.github.io/xcmsTutorials/index.html.

Some key points:

  • A lot of great info is out there, we didn’t (re)invent the wheel, read the docs!
  • Encountering difficulties or even bugs? Don’t be afraid to reach out and ask questions on e.g. github: https://github.com/sneumann/xcms/issues It’s a great way to learn more and start interacting with the community.
  • Don’t use the default parameters without giving it some thought!
  • R/xcms isn’t the only option, tools such as mzMine3 are equally good, find what floats your boat, but understand what you are doing.